Introduction to Statistics

Bennett Kleinberg

Week 3

Week 3

  • z-score
  • basic probability
  • binomial distribution

The core idea

Sampling

  • we draw samples from the population
  • ideally: we do this at random (i.e. every member of the population has the same chance to be in our sample)
  • the data (obtained from the sample) can be represented in various forms
    • summary statistic
    • raw data
    • through the data collection procedure

It is useful to think of data as a distribution.

Example

  • Suppose we have ask 1000 students about their intro to statistics grade
  • We can look at the raw data

We can obtain summary statistics:

  • \(M = 7.02\)
  • \(SD = 0.99\)
  • \(var = 0.98\)

Histogram

Density

Looking at distributions

  • we can get an idea of the spread of the data
  • we can understand the kind of distribution (more in Week 4)
  • we can get an idea of the probability of a certain value \(X\)

What we know

Locations in the distribution

Where does a value of \(X = 5.5\) lie in the distribution?

Locations in the distribution

Where does a value of \(X = 5.5\) lie in the distribution?

  • But what if I want to compare the locations?

Suppose we have another sample…

Another sample

Locations in the distribution

  • So we want a statistic (a score) that gives us the locations
  • … in a comparable sense
  • … relative to its distribution

We can take the mean as our orientation.

… then we can say how close/far a value is from the mean

Enter: the z-score

Idea: we locate a point relative to the mean in terms of SDs away.

\(z = \frac{X - \mu}{\sigma}\)

Suppose: \(\mu = 7\) and \(\sigma = 1\)

For our value of 5.5:

\(z = \frac{X - \mu}{\sigma} = \frac{5.5 - 7}{1} = -1.5\)

A value of 5.5 in our data has a z-score of -1.50.

z-scores

Comparing the two

z-scores of our examples

Red distribution: \(X \sim N(\mu, \sigma)\) –> \(X \sim N(7.00, 1.00)\)

  • A grade of 8.0
  • \(z = \frac{X - \mu}{\sigma} = \frac{8.0-7.0}{1} = 1\)
  • A grade of 8.0 has a z-score of 1.00(i.e. it is one standard deviation above the mean)

z-scores of our examples

Blue distribution: \(X \sim N(\mu, \sigma)\) –> \(X \sim N(7.00, 0.50)\)

  • A grade of 8.0
  • \(z = \frac{X - \mu}{\sigma} = \frac{8.0-7.0}{0.5} = 2\)
  • A grade of 8.0 has a z-score of 2.00 (i.e. it is two standard deviations above the mean)

This is really useful

  • We can now work with standardised values
  • We can also standardise whole distributions

z-transformations

Same score, find z

Same z, find scores

We can also “re-scale” distributions now

Say we wanted to project these data to a new distribution:

  • with \(M = 100\) (original: \(M=7.0\))
  • and \(SD=10\) (original: \(SD=1.0\))
id value z-score new value
1 6.0 -1.0 90
2 4.5 -2.5 75
3 9.5 2.5 125
4 7.5 0.5 105
5 5.5 -1.5 85

z, sigma, and mu

We can infer \(z\), \(\mu\), \(X\) and \(\sigma\) from the z-score formula:

\(z = \frac{X - \mu}{\sigma}\), i.e.

\(X = \mu + z\sigma\), and

\(-\mu = z\sigma - X\), and

\(\sigma = \frac{X-\mu}{z}\)

The generalisation

When we “standardise” the distribution, how does it affect the mean \(\mu\) and standard deviation \(\sigma\)?

  • the mean: becomes 0.00
  • the SD: becomes 1.00

Take this population with \(\mu=3\) and \(\sigma=0.80\)

id value z
1 1 -2.50
2 2 -1.25
3 3 0.00
4 4 1.25
5 5 2.50

This results in:

\(\mu = \frac{-2.50-1.25+0.00+1.25+2.50}{5} = \frac{0}{5} = 0\)

\(\sigma^2 = \frac{SS}{N} = \frac{(-2.50)^2+(-1.25)^2+(0.00)^2+(1.25)^2+(2.50)^2}{5} = \frac{5}{5} = 1\)

Would you trust me?

Suppose …

Basic probability

Simplest form:

  • Probability of something = \(\frac{something}{everything}\)
  • \(P(A) = \frac{\# A}{\# \ possible\ outcomes}\)

Requires random sampling (see page 163)!

Examples

  1. \(P(captain) = \frac{1}{11} = 0.09 = 9\%\)
  2. \(P(birthday) = \frac{1}{365} = 0.0027 = 0.27\%\)
  3. \(P(correct\ guess) = \frac{1}{4} = 0.25 = 25\%\)

Back to our problem

We know the guessing probability: 0.50 (or 50%).

  • So for every prediction it’s \(P(correct)=0.50\)
  • The predictions are independent, so we need to multiply them:

\(P(1st\ correct\ and\ 2nd\ correct)=0.50*0.50 = 0.25\)

Thus for 10 correct predictions:

\(P(correct)*P(correct)*P(correct)*...\) –> \(P(correct)^{10}\)

\(P(0.50)^{10} = 0.0009765625\) or 1/1024

A great scam!

Birthday example

What is the probability that two people have the same birthday in a class of 10/25/50 students?

We’ll solve this stepwise in the live session

Remember Maria?

Maria is 26 years old, single, outspoken, and very bright. She majored in law. As a student, she was deeply concerned with issues of discrimination and miscarriage of justice and participated in weekly animal-rights demonstrations.

Which is more probable?

  • A: Maria works in a law firm
  • B: Maria works in a law firm and does pro bono work for animal-rights activists

Joint probability

Formalising the problem:

  • \(P(A)\) (Maria works in a law firm)
  • \(P(B)\) (Maria works in a law firm and does pro bono work for animal-rights activists)

Why is \(P(B) < P(A)\)?

\(P(B)\) –> \(P(A)\) + does pro bono work for animal-rights activists

Let does pro bono work for animal-rights activists be \(P(C)\)

Joint probability

Two events occuring together is less probable than each event happening individually (if they are independent).

So \(P(B) = P(A \cap C) = P(A)*P(C)\)

Suppose:

  • \(P(A) = 0.6\)
  • \(P(C) = 0.7\)
  • \(P(B) = P(A)*P(C) = 0.6*0.7 = 0.42\)

Screening example

Conditional probability

What we are after is: probability of TERRORIST given that there is an ALARM

In probabiliy notation this is expressed as: \(P(T \mid A)\)

Solving the problem

Terrorist Passenger
Terrorist 950 50 1,000
Passenger 4,950 94,050 99,000
5,900 94,100 100,000

\(P(terrorist \mid alarm) = 950/5900 = 16.10%\)

The normal distribution

  • most important distribution in statistics
  • also called: Gaussian distribution, bell-shaped distribution
  • symmetrical shape

Defined by two parameters:

  • \(\mu\) and \(\sigma\) expressed as \(X \sim N(\mu, \sigma)\)
  • special case: \(X \sim N(0, 1)\) (The standard normal)

Note: a normal distribution is always bell-shaped, but not every bell-shaped distribution is a normal distribution.

Standard normal

Wider

Narrower

Different mean

The normal distribution

We can locate each y-value.

Each x-value corresponds to a probability through the probability density function (PDF):

\(Y = \frac{1}{\sqrt{2\pi\sigma^2}}e^\frac{-(X-\mu)^2}{2\sigma^2}\)

e.g. for \(X = 3\) in \(N(0,1)\)

\(Y = \frac{1}{\sqrt{2\pi}}e^\frac{-(3)^2}{2} = \frac{1}{2.51}e^{-4.5} = \frac{1}{2.51}*0.01 = 0.0039\)

i.e. the probability of \(X=3\) under the standard normal is ~0.39%.

Probability density function (PDF)

We can apply the PDF and obtain the exact shape of the normal distribution.

But we don’t have to do this

There is a nice relationship between the distribution and z-scores.

And we can describe the portions of the function in terms of z-scores.

Whole area = 1

Probability from the shape

  • We know that the whole area (note: the curve is asymptotic to both sides) covers all values
    • i.e. the area under the curve here has to be 1
  • So we also know that half of the area equals 0.50
  • And 1/3 of the area equals 33.33%
  • etc.

Half the area equals 50%

z-scores

We can calculate the area covered between two x-values.

z-scores

We can calculate the area covered between two x-values.

We don’t need to because we know how these areas relate to z-scores:

Area = 68.26%

Area = 95.44%

The unit table

  • The information re. z-score and area can be found in the unit table
  • Full table in Appendix B (p. 647-650)
z Prop in body Prop in tail Prop between M and z
1.00 0.8413 0.1587 0.3413
1.96 0.9759 0.0250 0.4750

Proportion in body (z=1.00)

Proportion in tail (z=1.00)

Proportion between M and z

Our ingredients

  • We know which body/tail/M-z probabilities correspond to a z-score
  • So we can calculate how probable certain values are:

For a standard normal, how likely is it to obtain a value of \(X=0.5\)?

Note: we really need to ask is “areas” how likely is it to obtain a value of at most 0.5?

Our target area

Using the unit table

How likely is it to obtain a value of at most 0.5?

z Prop in body Prop in tail Prop between M and z
0.50 0.6915 0.3085 0.1915

The green area corresponds to the proportion in the body = 69.15%.

A value of at most 0.5 (i.e. 0.5 or lower) has a probability of 69.15%.

Different question, different area

How likely is it to obtain a value of at least 0.5?

Note: this means “0.5 or higher”"

New target area

Using the unit table

How likely is it to obtain a value of at least 0.5?

z Prop in body Prop in tail Prop between M and z
0.50 0.6915 0.3085 0.1915

The green area corresponds to the proportion in the tail = 30.85%.

A value of at least 0.5 (i.e. 0.5 or higher) has a probability of 30.85%.

Full example

  • From a normal distribution
  • to a z-score
  • to probabilities

How likely is it to be taller than 1.90m?

We’ll do this in-depth in the live session.

Questions we can answer

  • How likely is an IQ score of 130?
  • How many people have an IQ score between 99 and 101?
  • And how many a score between 150 and 151?

More in the live session

The binomial distribution

  • Some variables are categorical
  • e.g. they take one of a few possible values (green, blue, brown)
  • or flipping a coin (heads vs tails)

We call these data binomial data.

And the corresponding distribution the binomial distribution.

Binomial data

2 possible outcomes A and B.

  • \(P(A) = p\)
  • \(P(B) = q\)

Because we have only two outcomes, \(P(A) + P(B) = 1\), so

  • \(p + q = 1\)
  • \(q = 1-p\)

Take our guessing example!

50/50 chance

  • \(P(A) = p = 0.50\)
  • \(P(B) = q = 1-p = 0.50\)

Let’s denote \(p\) as a correct guess.

So if I guess once: \(p = 0.50\)

  • Two outcomes: correct or incorrect.

Making more guesses

2 guesses: now we have four outcomes

  • correct correct
  • correct incorrect
  • incorrect correct
  • incorrect incorrect

So we can count:

  • having both correct: 1/4 = 0.25
  • having both incorrect: 1/4 = 0.25
  • having one correct (and one incorrect): 2/4 = 0.50

Let’s do our mini study

  • I guess 10 times
  • and get 2 times the correct one

Is this expected?

How (un)likely is that?

What if we did this, say, 1,000 times…

This looks very “normal”

A million times

Properties of the binomial

Described formally through by two parameters:

\(X \sim B(n, p)\)

  • \(n\) = no. of trials (e.g. 10)
  • \(p\) = prob. of success

Note: when \(n=1\), the binomial distribution is called the Bernoulli distribution.

The binomial approaches the normal

Approaches normal with increasing \(n\). Then:

\(\mu = pn\), and

\(\sigma = \sqrt{npq}\)

We can then also go back to z!

\(z = \frac{X-\mu}{\sigma} = \frac{X-pn}{\sqrt{npq}}\)

Precise probability for X

  • I guess 10 times
  • and get 2 times the correct one (\(X=2\))

We know:

  • \(n=10\)
  • \(p = q = 0.5\)

So:

\(\mu = pn = 0.5*10 = 5\)

\(\sigma = \sqrt{npq} = \sqrt{10*0.5*0.5} = \sqrt{2.5} = 1.58\)

To a z-score

\(X=2\)

\(z = \frac{X-\mu}{\sigma} = \frac{X-pn}{\sqrt{npq}} = \frac{2-5}{1.58} = -1.90\)

At least 2: looking at the body prob. in the unit table: 0.9713 (97.13%)

At most 2: looking at the tail prob. in the unit table: 0.0287 (2.87%)

\(X=10\)

\(z = \frac{X-\mu}{\sigma} = \frac{X-pn}{\sqrt{npq}} = \frac{10-5}{1.58} = 3.16\)

Looking at the tail prob. in the unit table: 0.0008 (0.08%)

By rounding = 1/1024!

Remember the core aim?

Recap

  • z-scores
    • why they are useful
    • how to obtain them
    • how to locate points
  • probability
    • areas of distributions as probabilities
    • the normal distribution
    • the binomial distribution

Next week

  • sampling and distributions
  • hypothesis testing
  • confidence intervals